When Anthropic dropped Opus 4.6, we asked it to figure it out. From our eval logs, Opus 4.6 observed that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks. Furthermore, as typo rate increases, Haiku shifts its behavior for ~20% of its responses from generating multiple code blocks to generating just a single code block.
I also have to say, i don’t know what you mean by “code block”—does this mean response?
I also have to say, i don’t know what you mean by “code block”—does this mean response?